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Abstract 


Previous research on relation classification 
has verified the effectiveness of using de¬ 
pendency shortest paths or subtrees. In 
this paper, we further explore how to make 
full use of the combination of these depen¬ 
dency information. We first propose a new 
structure, termed augmented dependency 
path (ADP), which is composed of the 
shortest dependency path between two en¬ 
tities and the subtrees attached to the short¬ 
est path. To exploit the semantic represen¬ 
tation behind the ADP structure, we de¬ 
velop dependency-based neural networks 
(DepNN): a recursive neural network de¬ 
signed to model the subtrees, and a convo¬ 
lutional neural network to capture the most 
important features on the shortest path. 
Experiments on the SemEval-2010 dataset 
show that our proposed method achieves 
state-of-art results. 


1 Introduction 


Relation classification aims to classify the seman¬ 
tic relations between two entities in a sentence. It 
plays a vital role in robust knowledge extraction 
from unstructured texts and serves as an interme¬ 
diate step in a variety of natural language process¬ 
ing applications. Most existing approaches follow 
a machine learning based framework and focus on 
designing effective features to obtain better classi¬ 
fication performance. 

The effectiveness of using dependency re¬ 
lations between entities for relation classi¬ 
fication has been reported in previous ap¬ 
proaches dBach and Badaskar, 2007 k For ex¬ 
ample, ISuchanek et al. (2006 1 carefully selected 
a set of features from tokenization and depen¬ 
dency parsing, and extended some of them to 
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generate high order features in different ways. 


Culotta and Sorensen (20041 designed a depen¬ 


dency tree kernel and attached more informa¬ 
tion including Part-of-Speech tag, word chunk¬ 
ing tag to each node in the tree. Interestingly, 
Bunescu and Mooney (2005]l provided an impor¬ 


tant insight that the shortest path between two 
entities in a dependency graph concentrates most 
of the information for identifying the relation be¬ 
tween them. Nguyen et al. (2007] ) developed these 
ideas by analyzing multiple subtrees with the guid¬ 
ance of pre-extracted keywords. Previous work 
showed that the most useful dependency informa¬ 
tion in relation classification includes the shortest 
dependency path and dependency subtrees. These 
two kinds of information serve different functions 
and their collaboration can boost the performance 
of relation classification (see Section 2 for de¬ 
tailed examples). However, how to uniformly and 
efficiently combine these two components is still 
an open problem. In this paper, we propose a 
novel structure named Augmented Dependency 
Path (ADP) which attaches dependency subtrees 
to words on a shortest dependency path and focus 
on exploring the semantic representation behind 
the ADP structure. 


Recently, deep learning techniques have been 
widely used in modeling complex structures. This 
provides us an opportunity to model the ADP 
structure in a neural network framework. Thus, 
we propose a dependency-based neural network 
where two sub-neural networks are used to model 
shortest dependency paths and dependency sub¬ 
trees respectively. One convolutional neural net¬ 
work (CNN) is applied over the shortest depen¬ 
dency path, because CNN is suitable for captur¬ 
ing the most useful features in a flat structure. A 
recursive neural network (RNN) is used for ex¬ 
tracting semantic representations from the depen¬ 
dency subtrees, since RNN is good at modeling 
hierarchical structures. To connect these two sub- 















Si: A thief who tried to steal the truck broke the ignition with screwdriver. 

*^nsubj-^ ^aux-^ *^det-^ ^—det— 


-prep_with— 

—det—— ^det-—-dobj— 

S2: On the Sabbath the priests broke the commandment with priestly work. 


-det— 


—amod— 


Figure 1: Sentences and their dependency trees. 
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Figure 2: The bold part is the shortest path between two entities in the undirected version of dependency 
tree, and some subtrees are attached to it. They two are combined as an augmented dependency path. 


networks, each word on the shortest path is com¬ 
bined with a representation generated from its sub¬ 
tree, strengthening the semantic representation of 
the shortest path. In this way, the augmented de¬ 
pendency path is represented as a continuous se¬ 
mantic vector which can be further used for rela¬ 
tion classification. 

The major contributions of the work presented 
in this paper are as follows. 

1. We extend the shortest dependency path into 
the augmented dependency path to better 
model the relation between two entities. 

2. We propose a dependency-based neural net¬ 
work, DepNN, to model the augmented de¬ 
pendency path. It combines the advantages 
of the convolutional neural network and the 
recursive neural network. 

3. We conduct extensive experiments on the Se- 
mEval 2010 dataset and the experimental re¬ 
sults show that DepNN outperforms baseline 
methods and yields state-of-the-art FI mea¬ 
sure on the relation classification task. 

2 Problem Definition and Motivation 

The task of relation classification can be defined 
as follows. Given a sentence S with a pair 
of entities ei and 62 annotated, the task is to 
identify the semantic relation between ei and 62 
in accordance with a set of predefined relation 


Relation Type 

Definition 

Cause-Effect 

X is the cause of Y 

Entity-Origin 

Y is the origin of an entity X , and X 
is coming or derived from that origin. 

Message-Topic 

A is a communicative message con¬ 
taining information about Y 

Product-Producer 

X is a product of Y 

Entity-Destination 

Y is tbe destination of X in tbe sense 
of X moving toward Y 

Member-Collection 

X is a member of Y 

Instrument-Agency 

X is tbe instrument (tool) of T or T 
uses X 

Component-Wbole 

X bas an operating or usable purpose 
witbin Y 

Content-Container 

X is or was stored or carried inside Y 


Table 1: Relation types of {X, Y) and their defini¬ 
tions in SemEval-2010 task 8 . 


types. According to the the official guideline 
of SemEval-2010 task 8 dHendrickx et ah, 2010| |, 
there are 9 ordered relation types. We list them in 
Tabled] with their simplified definitions. Instances 
don’t fall in any of these types are labeled as 
Other. For example, in Figure |2j the relation 
between two entities ei=thief and e 2 =screwdriver 
is Instrument-Agency. 


Bunescu and Mooney (2005 I reported that, for 


the relation classification task, the shortest depen¬ 
dency path between two entities plays a vital role. 
They pointed out that this kind of paths can cap¬ 
ture the predicate-argument sequences, providing 
helpful information for relation classification. For 
example, in Figure |2al the shortest path includes 
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Figure 3: Illustration of dependency-based neural networks. 


the structure of “broke prepjwith screwdriver”, 
helping judging the Instrument-Agency relation. 

Although the shortest dependency paths prove 
useful for relation classification, there exists other 
information on the dependency tree that can be 
exploited to represent the relation more precisely. 
For example, Figure|2a]and[^show two instances 
which have similar shortest dependency paths but 
belong to different relation types. In this situation, 
if we only use the shortest dependency paths for 
judging relation types, it is difficult for us to 
distinguish these two instances. However, we 
notice that the subtrees attached to the shortest 
dependency paths such as “dobj—)^commandment” 
and “dobj—^ignition” can provide supplemental 
information for relation classification. Based on 
many observations like this, we propose the idea 
that we should employ these subtrees and combine 
them with the shortest path to form a more precise 
structure for classifying relations. This combined 
structure is called “augmented dependency path 
(ADP)”, as illustrated in Figure |2l 

Next, our goal is to capture the semantic rep¬ 
resentation of the ADP structure between two en¬ 
tities. The key problem here is how to combine 
the two components of ADP to incorporate more 
information. We propose that on the augmented 
dependency path, a word should be represented 
by both itself and its attached subtree. This is be¬ 
cause the word itself contains its general meaning 
while the subtree can provide semantic informa¬ 
tion about how this word functions in this spe¬ 
cific senfence. Wifh fhis idea, we adopf fhe re¬ 


cursive neural nefwork (RNN) fhaf is proved suif- 
able for modeling hierarchical sfrucfures fo build 
semanfic embeddings for fhe words on fhe shorf- 
esf pafh along wifh fheir subfrees. Affer obfaining 
fhese more precise word represenfafions, a con- 
volufional neural nefwork (CNN) can be applied, 
since if is good af modeling flal sfrucfures and can 
generafe a fix-sized vector confaining fhe mosf rel- 
evanf feafures. 


3 Dependency-Based Neural Networks 

In fhis section, we will infroduce how we use neu¬ 
ral nefwork fechniques and dependency informa¬ 
tion fo explore fhe semanfic connecfion befween 
fwo entities. We name our archifecfure of mod¬ 
eling ADP strucfures as dependency-based neu¬ 
ral nefworks (DepNN). Figure[3]illuslrales DepNN 
wifh a concrete example. Firsf, we associate each 
word w and dependency relafion r wifh a vecfor 
represenfafion Xyj,Xr G For each word w 

on fhe shorfesf dependency pafh, we develop an 
RNN from ifs leaf words up to fhe roof fo generafe 
a subfree embedding and concatenate wifh 
x^ fo serve as fhe final represenfafion of w. 

Nexf, a CNN is designed fo model fhe shorfesf 
dependency pafh based on fhe represenfafion of 
ifs words and relations. Finally our framework 
can efficienlly represenf fhe semanfic connecfion 
befween fwo entities wifh consideration of more 
comprehensive dependency information. 







































3.1 Modeling Dependency Subtree 

The goal of modeling dependency subtrees is to 
find an appropriate representation for the words 
on the shortest path. As mentioned above, we as¬ 
sume that each word w can be interpreted by it¬ 
self and its children on the dependency subtree. 
Then, for each word w on the subtree, its word 
embedding E ]^dim subtree representa¬ 
tion Cyj E are concatenated to form its fi¬ 
nal represenfafion E . For a word 

fhaf does nof have a subfree, we sef ifs subfree 
represenfafion as cleaf- The subfree represenfa¬ 
fion of a word is derived fhrough fransforming fhe 
represenfafions of ifs children words. During fhe 
boffom-up consfrucfion of fhe subfree, each word 
is associafed wifh a dependency relation such as 
dobj as in Figure [3l 

Take fhe ADP in Figure [3] for example, we firsl 
compufe leaves’ represenfafions like pthe, 

Pthe \^thei ^LEAf\ ( 1 ) 

Once all leaves are finished, we move fo inferior 
nodes wifh already processed children. In fhe ex¬ 
ample, confinuing from “fhe” fo ifs parenf, “Sab- 
bafh”, we compufe 

P Sabbath — \^Sabbath-,f^Sabbath\ ( 2 ) 

^Sabbath — f{Wdet ■ Pthe + b) (3) 

where / is a non-linear acfivafion function such 
as tank, Wdet is the transformation matrix as¬ 
sociated with dependency relation det and 6 is a 
bias term. We repeat this process until we reach 
the root on the shortest path, which in this case is 
“broke”. 


Pbroke [^brofcej ^broke] 

^broke ~ f{^^prep—on ' PSabbath 
T ^^dobj ■ Pcommandament) 


3.2 Modeling Shortest Dependency Path 

To classify the relation between two entities, we 
further explore the semantic representation behind 
their shortest dependency path, which can be seen 
as sequence of words interspersed with depen¬ 
dency relations. Take the shortest dependency 
path in last subsection for example. The sequence 
S will be. 


Wi ri W2 r2 W3 

S: [priests nsubj broke prep-with work] 

As the convolutional neural network (CNN) 
is good at capturing the salient features from a 
sequence of objects, we design a CNN to tackle 
the shortest dependency path. 

A CNN contains a convolution operation over 
windows of object representations, followed by a 
pooling operation. As we know, a word w on 
the shortest path is associated with the represen¬ 
tation through modeling the subtree. For a 
dependency relation r on the shortest path, we 
set its representation as a vector Xr E 
As a sliding window is applied on the sequence, 
we set the window size as k. For example, 
when k = 3, the sliding windows of S are 
{[rs wi ri], [ri W 2 r 2 ], [r 2 W 3 Ve]] where and 
Tf, are used to denote the beginning and end of a 
shortest dependency path between two entities. 

We concatenate k neighboring word (or depen¬ 
dency relation) representations within one window 
into a new vector. Assume Xi E '^dim-k+dimc.-nn, 
as the concatenated representation of the f-th win¬ 
dow, where Uw is the number of words in one 
window. A convolution operation involves a fil¬ 
ter Wi E ^ix-{dim-k+dira,-n^)^ operates on 

Xi to produce a new feature vector Li with I di¬ 
mensions, 

Li = FFiXi (6) 


The composition equation for any word w with 
children Q{w) is, 

c. = /( Wn,^,^,-p, + b) (4) 

qGChildren{w) 

Pq = [Xq,Cg] (5) 

where R[w,q) denotes the dependency relation be¬ 
tween word w and its child word q. This process 
continues recursively from leaves up to the root 
words on the shortest path. Each of these words 
will have a vector representation after this stage 
(Ppriests, Pbroke and p^ork in this example). 


where the bias term is ignored for simplicity. 

Then Wi is applied to each possible window 
in the shortest dependency path to produce a fea¬ 
ture map: [Lq, Li, L 2 , ■ ■ ■]■ Next, we adopt 
the widely-used max-over-time pooling operation 
( jCollobert et ah, 201 Ij ), which can retain the most 
important features, to obtain the final represen- 
fafion L from fhe feafure map. Thaf is, L = 
max{Lo, Li, L 2 ,...). 

By fhis means, we are able fo obfain fhe seman- 
fic represenfafion of ADP wifh advanfages of bofh 
RNN and CNN. 



3.3 Learning 

Like other relation classification systems, we also 
incorporate some lexical level features which are 
proved useful for this task. This includes named 
entity tags and WordNet hypernyms of ei and 62 - 
We concatenate them with the ADP representation 
L to produce a combined vector M. We then 
pass M to a fully connected softmax layer 
whose output is the probability distribution y over 
relation labels. 


Relation 

Frequency 

Train Test 

Other 

Cause-Effect 
Component-Whole 
Entity-Destination 
Product-Producer 
Entity-Origin 
Memher-Collection 
Message-Topic 
Content-Container 
Instrument-Agency 

1410 (17.63%) 454 (16.71%) 

1003 (12.54%) 328 (12.07%) 

941 (11.76%) 312(11.48%) 

845 (10.56%) 292 (10.75%) 

717(8.96%) 231 (8.50%) 

716(8.95%) 258 (9.50%) 

690 ( 8.63%) 233 (8.58%) 

634(7.92%) 261(9.61%) 

540 ( 6.75%) 192 (7.07%) 

504 ( 6.30%) 156 (5.74%) 

Total 

8000 (100%) 2717 (100.00%) 


M = [L,LEX] 

y = softmax{W 2 M) 


(V) 

( 8 ) 


We define fhe ground-frufh label vector t for each 
insfance as a binary vecfor. If fhe insfance belongs 
to fhe fhe f-lh fype, only U is 1 and fhe ofher 
dimensions are sef to 0. To learn fhe paramefers, 
we opfimize fhe cross-enfropy error befween y and 
t using stochastic gradienf descenf dBoffou, 2004| ). 
For each fraining insfance, we define fhe objective 
function as: 


in{ E tjlog{yj)) 


mmy — 
0 


(9. 


where 6 represenfs fhe paramefers. Gra 
dienfs are compufed using backpropaga- 


fion dRumelhart ef ah, 1988| ). 


Table 2: Sfalisfics of SemEval-2010 dafasef. 


kinds of word embeddings are used for inifializa- 
fion. One is fhe 50-d embeddings provided by 
SENNA dColloberf ef ah, 201 1| |. The second is fhe 
200-d embeddings dYu ef ah, 2014| ) frained on Gi- 
gaword wifh word2vec0- The corresponding hy- 
perparamefers are sef wifh 5-fold cross validafion, 
including window size k, learning rafe A, subfree 
embedding’s dimension dime, and hidden layer 
size 1. The final sellings are shown in Table [3j 



k 

A 

dime 

1 

50-d 

5 

0.05 

25 

200 

200 -d 

5 

0.05 

100 

400 


Table 3: Hyperparamelers sellings. 


4 Experiments 

Our experimenls are performed on SemEval-2010 
dafasef dHendrickx ef ah, 2010| ). The fraining pari 
of fhe dafasef includes 8000 inslances, and fhe 
lesl pari includes 2717 inslances. Table |2] shows 
fhe sfalisfics of fhe annofaled relation fypes of 
Ihis dafasef. We can see lhaf fhe dislribulion of 
relalion types in fhe fesl sef is similar lo lhaf 
in fhe fraining sef. The official evaluation mel- 


Eor evaluation, we firsl design a relation ex- 
Iraclion system (named PATH) which only models 
fhe shorfesl dependency palh wifh a CNN. Based 
on PATH, We consider fo incorporale fhe fwo 
kinds of lexical fealures including named enlify 
lags (NER) and WordNef hypernyms (WN). Then, 
we gel fwo syslems which are named PATH-i-WN 
and PATH-i-NER respeclively. We also add fhe 
attached sublrees (SUB) modeled by an RNN lo 
form fhe complete augmented dependency pafh. 


Other). We use dependency trees generated by the 

Stanford Parser (Klein and Manning, 20031 with 

Model 

El 

50-d 200-d 

the “collapsed” option, which regards a prepo¬ 
sition as a kind of dependency relation. As 

PATH 

PATH-hWN 

PATH-hNER 

PATH-hSUB 

80.3 81.8 

80.8 82.0 

81.1 82.4 

81.2 82.8 

de Mameffe and Manning (2008 1 pointed out, this 

option is more useful for event relation extraction. 


4.1 Analysis of DepNN 

4.1.1 Contributions of different components 

We firsl show fhe confribufions from differenl 
componenls of DepNN. In our experimenls, fwo 


Table 4: Performance of DepNN wifh differenl 
componenls. 

Prom Table |4l we can verify fhe effecliveness 
'https://code.google.eom/p/word2vec/ 
































of modeling the shortest dependency path with a 
CNN, since PATH can achieve a relatively high 
result. The experiment results also indicate that 
both the NER and WordNet features can improve 
the performance of relation extraction. WordNet 
seems less useful than NER, which conforms to 
the results of |Yu et al. (2014| | , since a large num¬ 
ber of WordNet hypemyms may cause overfitting. 
Eurthermore, the attached subtrees, as we expect, 
can provide an obvious boost to DepNN. The NER 
tags, WordNet hypemyms and subtrees all con¬ 
tribute to the performance by providing supple¬ 
mental information for words on the shortest path. 
The experiments show that the subtree informa¬ 
tion does a better job than the other two kinds of 
information and can help build more precise rep¬ 
resentations for words in a sentence. To get a 
deeper understanding of what semantic informa¬ 
tion can be captured behind the ADP structure, we 
will look into our model and analyze it with spe¬ 
cific examples. Since fhe Gigaword embeddings, 
wifh ifs larger corpus and dimensions, can signifi- 
canfly improve fhe classificalion performance, fhe 
following experimenfs and analysis are all based 
on Gigaword embeddings. 


4.1.2 Intuitive Analysis of Shortest Path 

We take the output vector of the CNN layer as the 
distributed representation of a dependency path. 
In this way, we can calculate the cosine similarity 
between any two paths and illustrate some paths 
with high similarity. Table [5] shows three training 
instances with different relation types and their 
three most similar paths in the test set. 

Prom Table [5j we can see that our approach 
can capture the core meaning of the shortest de¬ 
pendency paths. Por example, for the Instrument- 
Agency relation, we infer that the dependency re¬ 
lations “nsubjJnv”, “dobj” and “prepjwith” in the 
dependency path play a main role in the repre¬ 
sentation and our model can capture these simi¬ 
lar paths. Por the Product-Producer relation, our 
model focuses on representing the structure of 
“nsubjJnv verbl xconip verb2 dobj” and exploits 
some words like “pencil” and “create” in the path 
representation. This is clearer for the Message- 
Topic relation, where the similarity of words like 
“point”, “explore”, “address” and “relate” are well 
learned. 


Instrument-Agency 

master nsubjjnv teaches dobj lesson prepjwith stick 
analyzer prep-ofJnv core nsubjjnv identifies dobj paths 
vmod using dobj method 

architect nnJnv measures dep Sage prepjtvith strip 
shop nsubjjnv fixed prepjtvith method 
Product-Producer 

factory nsubjjnv began xcomp manufacture dobj ban- 
duras 

designer nsubjjnv made dobj sets 
writer rcmod pencilled dobj storyboard 
student nsubjjnv spent xcomp creating dobj application 
Message-Topic 

article prep-in-inv explores dobj Impulsivlty 
article rcmod pointed dobj problems 
speech vmod addressing doijpractlces 
chapter nsubjjnv relates dobj attempts 

Table 5: Shortest dependency paths and their 
closest neighbours in the learned feature space. 

4.1.3 Influence of Attached Subtree 

In this subsection, we will discuss the role of 
attached subtree (SUB) in relation classification. 
By comparing the results of DepNN before and 
after adding the subtree, we find the influence of 
this structure varies from different relation types. 
Table 0 shows the E1 measures of each relation 
type before and after adding the subtree. 


Relation 

El 

No SUB 

With SUB 

Change 

Component-Whole 

0.805 

0.812 

0.007 

Instrument-Agency 

0.683 

0.714 

0.031 

Member-Collection 

0.818 

0.829 

0.011 

Cause-Effect 

0.881 

0.89 

0.009 

Entity-Destination 

0.862 

0.869 

0.007 

Content-Container 

0.826 

0.828 

0.002 

Message-Topic 

0.854 

0.856 

0.002 

Product-Producer 

0.776 

0.801 

0.025 

Entity-Origin 

0.853 

0.857 

0.004 


Table 6: Influence of the subtrees on each relation 
type. 

We can see that the subtree information gener¬ 
ally has a positive impact on all the relation types. 
It is especially salient for the Instrument-Agency 
and Product-Producer relations. With only using 
the shortest dependency paths, these two kinds of 
relation types are easily confused, as they both 
rely on the dependency paths such as “... verb 
prep-by/prep-with/using ... ”. But after consider¬ 
ing the subtree information, we can better distin¬ 
guish these two relation types. Eigure |4] lists two 
instances that can be classified correctly only af- 




















Model 

Additional Eeatures (AE) 

El 

with AE 

without AE 

SVM 

POS, prefixes, PropBank, Google n-gram, 
NomLex-Plus, Eevin classes, WordNet, 
dependency parse, morphological, 
ErameNet, TextRunner, paraphrases 

82.2 

- 



MV-RNN 

POS, NER, WordNet 

81.8^ 

78.2 

CNN 

WordNet 

82.7 

79.2 

ECM 

NER 

83.0 

82.2 

DT-RNN 

NER 

73.1 

72.1 

DepNN 

NER 

83.6 

82.8 


Table 7: Results of evaluation on the SemEval-2010 dataset. 
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Figure 4: ADP of instances that can be classified 
correctly after adding the subtrees. 


ter adding the subtrees. Figure |4a] belongs to the 
Producer-Produce relation which can be reflected 
by the subtree structures like “conj-and—)-valves” 
and “amod—)-manufacturing”. Figure l4bl belongs 
to the Instrument-Agency relation, and the sub¬ 
tree structure attached to the word “scaled” pro¬ 
vides more supplemental information to the short¬ 
est path as explained above. 


4.2 Comparison with Baselines 


In this subsection, we compare DepNN with sev¬ 
eral baseline approaches of relation classification. 


SVM ( |Rink and Harabagiu, 2010] ): This is the 
top performed system in SemEval-2010. It de¬ 
pends on the human compiled feature templates 
and then utilizes many external corpora to extract 
features for an SVM classifier. 


MV-RNN (Socher et al., 20121: This model as 


sociates each word with a matrix. Based on the 
constituent parse tree structure, this model finds 


^MV-RNN achieves a higher Fl-score (82.7) on SENNA 
embeddings reported in the original paper. 


the path between two entities and learns the dis¬ 
tributed representation of their highest parent node 
through the composition in a recursive neural net¬ 
work. 

DT-RNN dSocher et al., 2014] ) : This model 
uses an RNN for modeling dependency trees. It 
assigns a composition matrix to each dependency 
relation. Different from our model DepNN, the 
embedding of each node is a linear combination 
of its children. The network is trained using 
the method provided by ( |Iyyer et al., 2014 ). We 
average the learned vectors of all nodes, stack 
it with the root node’s embedding and additional 
features, and feed them into a softmax classifier. 

CNN: Zeng et al. (20141 build a convolutional 
model to learn a sentence representation over the 
words in a sentence. To represent each word, 
they use a special position vector to indicate the 
relative distances of current input word to two 
marked entities, concatenating the position vector 
with the corresponding word embedding. Then the 
sentence representation is staked with some lexical 
features and fed into a softmax classifier. 

FCM dYu et al, 2014| ): FCM decomposes a 
sentence into some substructures and learns sub¬ 
structure embedding from each of them. Then the 
substructure embeddings in a sentence are com¬ 
bined via a sum-pooling operation and put into a 
softmax classifier. 


Table |7] compares DepNN with the baseline ap¬ 
proaches. Since many of our baselines are neu¬ 
ral network models, it is convenient for them 
to use some features extracted with external re¬ 
sources or tools to enhance performance. We call 
these features “additional features” (AF) and list 
them in the second column. The FI-measures on 
SemEval-2010 dataset with/out these additional 


































features are shown in the last two columns. 

From Table |7J we can see that DepNN achieves 
the best result (83.6) with the NER features. SVM 
achieves a comparable result, though the quality 
of feature engineering highly relies on human ex¬ 
perience and external NLP resources. MV-RNN 
models the constituent parse trees with a recur¬ 
sive procedure and its FI-measures with/out AF 
are about 1.7 percent and 4.6 percent lower than 
those of DepNN. This to some extent indicates 
that our proposed ADP structure is more suitable 
for relation classification task. Meanwhile, MV- 
RNN is very slow to train, since each word is 
associated with a matrix. Both CNN and FCM 
use features from the whole sentence and achieve 
similar performance. DT-RNN is the worst of all 
baselines, though it also considers the information 
from shortest dependency paths and attached sub¬ 
trees. As we analyze, shortest dependency paths 
and subtrees play different roles in relation clas¬ 
sification. But, we can see that DT-RNN does 
not distinguish the modeling processes of shortest 
paths and subtrees, and deems the representation 
of each node as a linear combination of its chil¬ 
dren. 


5 Related Work 


Relation classification is one traditional subprob¬ 
lem of Information Extraction (IE). It aims to 
detect and classify relations between the prede¬ 
fined fypes of objecfs in fhe corpus. These ob- 
jecfs could be named enfifies or marked nomi- 
na iH. Much research has been performed in fhis 
field, mosf of which considers if as a supervised 
mulfi-classificafion fask. Depending on fhe inpuf 
fo fhe classifier, fhese approaches can be furfher 
divided info feafure-based, free kernel-based and 
composite kernel-based. 

Eeafure-based mefhods exfracf various kinds of 
linguisfic fealures, including bofh synfacfic fea- 
fures and semantic cues. These fealures are 
combined fo form a fealure vector employed 
in a Max Enfropy dKambhafla, | ) or an SVM 
dZhou ef ah, 2005t GuoDong el ah, 20051 classi¬ 
fier. Eeafure-based mefhods usually need hand- 
crafled fealures and lack fhe abilily to represenf 
slruclural information (e.g., parsing free, word or¬ 
der). 

Kernel mefhods use a more nafural way of ex- 


^ACE Evaluation uses the named entities while the Se- 
mEval evaluation is based on nominals. 


ploring slruclural fealures by computing fhe in¬ 
ner producl of Iwo objecfs in fhe high-dimensional 
lalenf fealure space. [Zelenko ef al. (20(^ de¬ 
signed a free kernel fo compule fhe slruclural 
commonalily belween shallow parse frees by a 
weighled sum of fhe number of common sublrees. 


Culolla and Sorensen (20041 fransferred fhis ker¬ 


nel fo a dependency tree and attached more in¬ 
formation including POS tag, word chunk tag to 
each node. |Zhou et al. (2007 [ ) proposed a context- 
sensitive convolution tree kernel that used con¬ 
text information beyond the local tree. In an¬ 
other view, Bunescu and Mooney (20051 ) provided 
an important insight that the shortest path be¬ 
tween the two entities concentrates most of the 
information for identifying the relation between 
them. Nguyen et al. (2007] | used the dependency 
subtrees in a different manner by modeling the 
subtrees between entities and keywords of certain 
relations. [Zhang et al. (2006]) further proposed 


composite kernels to combine a tree kernel and a 
feature-based kernel to promote the performance. 

Recently, Deep Neural Networks (DNN) have 
been developed to solve the relation classification 
problem. By associating each word a distributed 
representation, DNN can overcome the sparsity 
problem in traditional methods and automatically 
learn appropriate features. jSocher et al. (2012] ) 
proposed a recursive neural network model by 
constructing compositional semantics for the min¬ 
imal constituent of a constituent parse tree includ¬ 
ing both marked entities. Zeng et al. (20141 used a 
convolutional neural network over the whole sen¬ 
tence combined with some lexical features. They 
also pointed out that the position of each word in 
the sentence is very important for relation classi¬ 
fication and concatenated a special position fea¬ 
ture vector with the corresponding word embed¬ 
ding. I Yu et al. (2014| ) proposed the Eactor-based 
Compositional Embedding Model which extracted 
features from the substructures of a sentence and 
combined them through a sum-pooling layer. 


6 Conclusion 

In this paper, we propose to classify relations be¬ 
tween entities by modeling the augmented depen¬ 
dency path in a neural network framework. Eor a 
given instance, we generate its ADP by combin¬ 
ing the shortest path between two entities and the 
attached subtrees. We present a novel approach, 
DepNN, to taking advantages of both convolu- 


























tional neural network and reeursive neural net¬ 
work to model this strueture. Experiment results 
demonstrate that DepNN aehieves state-of-the-art 
performanee. 
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